1. Introduction

In this Data anaysis project we will be focousin on the Us economy and its interrelationship to realty market. We will be analyzing data from various economic indicators to determine the health of the economy. This is also personal interest of mine to better understand the US economy and why certain factors are the way they are. I would like to also have a better understanding of the economy and maybe predic future economic factors. As I reside in Hawaii i also would like to better improve my understanding with the local economics of the state of Hawaii and how that relates to the local realty market in the islands. I also hope that this project can help others in understanding the US economy and Realty market as well. I will try my best to explain various concepts in easy to understand way as well.

2. Guided Question

Project Task: This data analysis project will explore the following key questions to build a comprehensive understanding of the U.S. economy, with a particular focus on how these factors influence the realty market nationally and in Hawaii.

  1. Understanding Core Economic Concepts

                a. What is Inflation, and how does it impact the economy?

                b. What are U.S. Treasury Bonds and their Maturity Rates?

                c. How do Federal Interest Rates relate to Mortgage Rates?

  1. Exploring Key Real Estate Indicators

                a. How do Mortgage Application Rates, Realty Prices, Inventory Levels, and Days on Market assess the Health of the Real Estate Market?

                b. What does the Foreclosure Rate Reveal about Economic Stability?

  1. How do Inflation, Bond Yields, and Federal Interest Rates Together Influence the Real Estate Market?

  2. How Does the National Economy Compare to Hawaii’s Economy, and How Do These Differences Affect Hawaii’s Realty Market?

  3. How Can We Identify a Potential Bubble in the Realty Market?

  4. How Can Understanding These Factors Guide Future Real Estate Decisions?

3. Data Prepertion

4. Process and Clean Data

4.1 Download the Libraries used to work with this data.

4.2 Download the raw data

                Inventory source Zillow

                Days in Market source- Zillow

                Mortgage Application source- Federal Financial Institutions Examination Council

                Mortgage Rate source- Federal Reserve Economic Data by St. Louis Fed

                Realty Prices source- Realtor.com

                Foreclosure rate source 1, source 2

                House Hold income Hawaii-Federal Reserve Economic Data by St. Louis Fed and National Data by Fred

                Inflation Rate Source- Federal Reserve Economic Data by St. Louis Fed and source 2

                Bond Market Yield Rate source- Federal Reserve Economic Data by St. Louis Fed

4.3 Clean up column names for proper formatting

4.4 Check to make sure all numbers are formatted correctly.

4.5 Check any missing data

4.5 Delete any rows with NA in the data

4.6 Convert necessary data to numeric

4.6 Identify outliers in the raw data.

(need to clena this up, summerize it)

High Variability in inventory and pending:

Extreme values in these datasets warrant further investigation, potentially pointing to unusual market conditions or data issues. Historical Patterns in drsfrmacbs, rcmflbactdpdpct90p, and dgs10:

Economic or market stress periods (e.g., 2009-2014 and early 1980s) show significant outliers. These could reflect broader economic trends, such as the 2008 crisis and inflation control policies. Kauai Real Estate Prices:

Kauai stands out in median_listing_price_hi with niche high-value properties. These should be contextualized against the broader market. Well-Distributed Data:

Some datasets (e.g., income_hi, cpilfesl, and median_listing_price_country) appear well-distributed, with no significant anomalies.

4.7 Check the unique values of each column:

4.10 check for leading and trailing white spaces:

4.11 Check and delete duplicate rows:

5. Analysis

What is Inflation?

Inflation is the gradual increase in the prices of goods and services over time. One primary cause of inflation is the devaluation of a currency, often resulting from excessive printing of fiat money. Another contributing factor is an imbalance between supply and demand—when demand for an item exceeds the supply, prices tend to rise. As inflation increases, consumers’ purchasing power is reduced because the same amount of money buys fewer goods and services. The opposite of inflation is deflation, which occurs when the prices of goods and services decrease, making items cheaper. This typically happens during a recession or when the economy is shrinking. In such situations, consumers tend to spend less, prompting sellers to reduce prices to encourage spending. While deflation may benefit consumers in the short term by allowing them to purchase more for less, it can have negative consequences for the economy. A shrinking economy often leads to reduced production, job losses, and lower overall economic growth.

The following the food and energy have volatile prices and so to show a clearer trend with less noise, those items where excluded. positive trend line and inflation does not seem to be slowing down.ever since 2021 inflation has been constant

5.1 Inflation exclude food and energy source

5.1.1 Inflation include food and energy source

What are Treasury Bonds?

Bonds are essentially loans from investors to entities such as governments or corporations, often described as IOUs. For example, 10-year Treasury bonds issued by the U.S. government have a fixed interest rate (coupon rate) that remains constant throughout the bond’s term. Investors, such as banks, often purchase these bonds as a way to earn a return on their cash while maintaining a relatively low-risk investment.

When interest rates and inflation interact, they significantly impact the bond market. If investors expect that the inflation rate over the bond’s lifetime will be higher than the bond’s fixed interest rate, the bond’s value decreases in real terms. In such cases, investors may choose to sell their bonds on the secondary market rather than holding them to maturity. To attract buyers in this scenario, the bonds are typically sold at a discounted price (below their face value), reflecting the reduced purchasing power of the fixed interest payments due to higher inflation.

5.6 Bond market source

How Mortgage Rates relate to the Bond Market?

Mortgage rates and the bond market are interconnected because both involve loans and interest rates. Mortgage rates are the interest borrowers pay on loans taken out to buy homes, often because they cannot pay the full price in cash. Banks, as profit-driven institutions, earn money through interest income from the mortgage loans they issue. However, banks also have other investment options, such as purchasing government bonds such as the Treasury bonds

Inflation plays a critical role in these decisions. If banks expect future inflation to be higher than current levels, they may seek to protect themselves from the erosion of purchasing power. For example, if the returns from bonds (due to their fixed interest) are anticipated to be lower than future inflation, banks may do the following to counteract potential losses:

• sell these bonds in the secondary market at a discount

• raise mortgage rates to earn higher returns and shield themselves from inflation’s impact on purchasing power.

5.4 Mortgage rate link

How the Federal Reserve Influences Interest Rates The Federal Reserve (Fed) plays a central role in managing inflation and economic stability by adjusting the federal funds rate, which is the interest rate at which banks lend to one another overnight. When the Fed lowers this rate, it typically leads to a decrease in the interest rates banks charge on loans, including mortgages. Lower interest rates make borrowing less expensive, encouraging individuals and businesses to take out loans for activities such as starting businesses or buying homes. This increased economic activity can drive demand for homes, often outpacing supply and leading to rising home prices.

Conversely, when the Fed raises interest rates to curb inflation, borrowing becomes more expensive. Higher mortgage rates discourage buyers, reducing demand in the housing market. As a result, sellers may lower home prices to attract buyers, leading to a decline in housing prices.

** Interest rate data

** check the number for the below write up

As of December 2024, the Federal Reserve has lowered the federal funds rate by 50 basis points in September, followed by two 25 basis point cuts in November and December, bringing the current target range to 4.25% to 4.5%. Despite these reductions, mortgage rates remain elevated, with the average 30-year fixed rate around 7.00% as of late December. This persistence in high mortgage rates may indicate that market participants expect future inflation to remain elevated, influencing long-term Treasury yields and, consequently, mortgage rates.

Identifying trends in reality data and charts

Knowing a little bit of macro economics, we focus on realty data trends and charts.

Mortgage Application Rates

Mortgage application rates are a key indicator of the real estate market’s current status and potential future trends. As the name suggests, this data reflects the number of people seeking loans from banks to purchase homes. Before obtaining a loan, individuals must first get pre-approved, a process in which banks assess their creditworthiness based on factors such as credit score, income, debt-to-income ratio, and employment history. The following graph visually represents the number of people applying for mortgage approval over time. When application rates decrease, it suggests lower demand for homes, while an increase indicates higher demand. Looking at the United States MBA Mortgage Market Index, as of December 27, 2024, this index provides a comprehensive overview of the nationwide mortgage market, covering all types of mortgage originators, including commercial banks, thrift institutions, and mortgage banking companies. The index currently stands at 174.9, which is relatively low compared to the past five years. Since October 2022, the index has hovered around 200, reflecting subdued mortgage activity. This suggests that fewer people are applying for mortgages compared to previous years.

Hawaii’s local mortgage data reflects similar trends to the national MBA Mortgage Market Index, albeit with unique regional dynamics. Mortgage applications in Hawaii peaked in 2021, with approximately 240,000 applications recorded in 2023—a significant decline. Among Hawaii’s counties, Honolulu County has the highest number of applications, followed by Hawaii Island County, Maui County, and Kauai County. Dwelling Types in Mortgage Applications

5.2.1 Mortgage Application source

The data also highlights trends in the types of dwellings associated with mortgage applications. Applications for multifamily manufactured homes have been nearly nonexistent since 2019. In contrast, applications for multifamily site-built homes increased to 140 in 2019 and remained steady until 2023, when they dropped significantly. Applications for single-family manufactured homes (1-4 units) decreased from 2018 to approximately 17 and remained at that level through 2023. Similarly, applications for single-family site-built homes (1-4 units) saw a steep decline, falling from a peak of 100,000 in 2021 to 40,000 in 2023.

Realty prices

The next indicator we will examine is realty prices, both nationally and locally. Our data, which dates back to 2016, reflects national averages. Since 2016, home prices have steadily increased, rising from an average of $250,000 to approximately $400,000. This trend indicates a consistent growth in overall home prices across the nation.

5.4.1 Realty Prices source prices of homes are to high.

In Hawaii, well-known for having the highest median home prices in the nation, the increases have been particularly dramatic. From 2016 to 2020, median home prices were roughly four times higher than the national average. However, from 2020 to 2024, prices surged further, reaching six to seven times higher, with Kauai County experiencing the most significant price increase. Maui County followed as the second-largest price growth area, while Hawaii Island saw median home prices rise from $500,000 to $600,000. Interestingly, Honolulu County’s prices remained relatively stable during this period.

This significant growth in home prices, particularly in Hawaii, may be attributed to a combination of factors, including pandemic-related government policies, such as COVID-19 relief measures and eased monetary policies designed to stimulate the economy.

here is a comparison between midian income of a household and home cost. particularly in hawaii. as it shows, many people in hawaii can not affor homes in hawaii. majority are priced out of hawaii. in regards to the rest of the country here is a link of he rest of the united states that maps what is and what is not affordable.

5.4.2 Price vs Household income

Indicators of a realty market bubble. When the realty prices increase so much compared to the income of a location, this could indicate a market bubble. The rapid increase of prices could be speculative buying. When we look at Hawaii’s price vs income, you can see realty prices are extreme like in respect to the median income. Median income in Hawaii is around 100k and below. During covid years (2020 to 2022) prices sky rocketed as mentioned where income stayed realtivly the same. Only beginning of 2024 has prices gone down but has yet gone close to pre covid prices

5.4.3 Personal Savings

here is the up to date information regarding personal savings in the us.The data can be found here. a more comprehensive data is fund here where it shows data all the way back from 2009 to present day

Inventory

This analysis examines inventory levels of homes nationally and locally in Hawaii. For buyers, higher inventory levels are favorable as they provide more options and increase competition among sellers, which can drive prices down. Historically, from 2017 to 2021, national inventory levels have been decreasing. This trend coincided with historically low interest rates driven by the Federal Reserve’s monetary policies, which reduced mortgage rates to as low as 2.5% in 2020. Graphically, this inverse relationship between mortgage rates and inventory levels is evident; as mortgage rates dropped, home-buying activity surged, reducing available inventory. By 2024, with rising mortgage rates, fewer people are purchasing homes, and fewer are applying for mortgages. This has contributed to a gradual increase in national inventory levels, rising from approximately 900,000 to 1.2 million homes.

In Hawaii, the inventory trends follow a similar pattern to the national curve but on a smaller scale. Notably, in early 2021, Hawaii’s inventory levels dropped sharply. This anomaly appears to be due to missing data for February to March 2021, primarily affecting Urban Honolulu, which accounts for the majority of the state’s housing inventory. Compared to outer island regions like Kapaa, Kahului, and Hilo, Urban Honolulu’s inventory is about twice as large. Despite this discrepancy, Hawaii’s overall inventory trends align closely with the national data, showing an increase in available homes in response to higher mortgage rates.

Here is data regarding current inventory of realty properties in hawaii and in the entire United States.higher inventory better for buyers, more choses, price goes down.

5.1 Inventory source

5.1.2 Inventory US and Hawaii

Days on the Market (DOM)

Refers to the number of days a property remains listed for sale before it is sold. This metric often mirrors trends in the real estate market. Prior to COVID-19, properties in Hawaii typically stayed on the market for about 75 to 100 days before being sold. However, during the pandemic, when mortgage rates dropped to historic lows, this duration decreased significantly, with properties selling in less than a month. Homes were being purchased almost as soon as they were listed.

Similar to inventory trends, after 2023 and into 2024, the time properties spend on the market has been gradually increasing, approaching pre-pandemic levels. This shift reflects the cooling of the real estate market due to rising mortgage rates, which have reduced buyer activity. Nationally, the average number of pending days (time between listing and sale) was about 25% lower than in Hawaii during the same period. This discrepancy is likely due to Hawaii’s higher housing costs, which typically require more time for buyers to secure financing or make purchase decisions.

Days on the Market (DOM)

Foreclosure Rate

The next indicator we will examine is the foreclosure rate. The foreclosure rate represents the percentage (or fraction) of mortgage balances that are 90 or more days past due. As reported on the Federal Reserve Economic Data (FRED) platform, this metric reflects the proportion of total outstanding mortgage balances in delinquency or foreclosure.

Historically, foreclosure rates have been decreasing since 2012. However, in 2020, the rate increased from 0.5% to 2.8%, largely due to widespread job losses caused by COVID-19 restrictions. By the end of 2020, as mortgage rates dropped to historic lows, foreclosure rates began to decline again. By February 2023, the foreclosure rate had reached a record low of 0.3%. How Foreclosure Rates Are Calculated

The foreclosure rate is calculated as follows:

• Numerator: The total dollar value of mortgage balances that are 90 or more days overdue (including those in foreclosure).

• Denominator: The total outstanding mortgage balances across all large banks.

For example, if the foreclosure rate is reported as 3%, it means that 3% of the total outstanding mortgage debt is delinquent or in foreclosure.

The accompanying chart specifically focuses on single-family residential mortgages booked in domestic offices of U.S. commercial banks. It highlights delinquency rates, which include mortgages that are past due but not yet in foreclosure. As of Q2 2024, the delinquency rate stood at 1.73%, indicating the percentage of homeowners behind on their mortgage payments. This data is regularly updated to reflect current trends.

the below chart shows homes that refers to the delinquency rate for single-family residential mortgages booked in domestic offices at all U.S. commercial banks. The delinquency rate represents the percentage of mortgage loans that are past due but not yet in foreclosure. As of Q2 2024, the delinquency rate was 1.73%, which indicates how many homeowners are behind on their mortgage payments. This data is updated quarterly and seasonally adjusted.


5.4 Mortgage rate link

##                      DATE OBMMIC30YFLVGT80FB720A739 
##                     FALSE                      TRUE

5.2 Mortgage application index ()

here is data regarding mortgage appliction throughout the USA.the graph has mortgage application index. the following link shows this graph. Here is another application link

The composite index in the context of mortgage applications represents a combined measure that tracks the overall volume of mortgage loan applications. It aggregates data from different types of loan applications, such as those for refinancing and purchasing homes, to give a comprehensive view of mortgage application activity.

This data can show us if people are wanting to buy homes, and based on the mortgae application index data there are less people ready to buy homes pre covid. The feds reduction of 50 bases points has little or no effect in changing peoples mind in buying homes.

5.2.1 source

most popular site built singelfamily house 1-4 highest being appplied to. 2nd is manufactured but 1000 times less.

5.3 Foreclosure rate

heres what forcluse rate mean. The percentage (or fraction) of mortgage balances that are 90 or more days past due, as reported in the foreclosure rate on FRED, represents the proportion of the total outstanding mortgage balance that is in delinquency or foreclosure.

Here’s what it means:

Numerator: The total dollar value of mortgage balances that are 30 or more days overdue (including those in foreclosure). Denominator: The total outstanding mortgage balances across all large banks. So, if the reported rate is, for example, 3%, it means that 3% of the total mortgage debt is delinquent or in foreclosure. In terms of a fraction, it would be expressed as:

5.4.1 Realty Prices source

prices of homes are to high.

5.4.2 Price vs Household income

here is a comparison between midian income of a household and home cost. particularly in hawaii. as it shows, many people in hawaii can not affor homes in hawaii. majority are priced out of hawaii. in regards to the rest of the country here is a link of he rest of the united states that maps what is and what is not affordable.

5.6 bond market source

##  DATE DGS10 
## FALSE  TRUE

5.1 Heat Map

5.2 Bar Graph

5.3 Bar Graph multiple dates

5.4 Geographic Map

6. Conclusion

  1. Implications for Future Decision-Making
  1. How Can Understanding These Factors Guide Future Real Estate Decisions? Summarize how knowledge of economic and real estate indicators can inform investment decisions, home purchasing, and policy-making. Discuss how anticipating changes in these factors might support better decision-making for buyers, investors, and policymakers in both the U.S. and Hawaii.

Based on key indicators, including mortgage application rates, realty prices, inventory levels, and days on the market, the real estate market as of January 2, 2025, appears to be in a downturn. Fewer people are applying for mortgage loans, reflecting a significant decline in buyer activity and demand. Home prices remain high or continue to rise, creating affordability challenges that further reduce purchasing activity. At the same time, inventory levels have increased, indicating more homes on the market relative to the number of active buyers. Additionally, properties are taking longer to sell, with the market gradually returning to pre-pandemic conditions.

7. Appendix

knitr::opts_chunk$set(echo = TRUE,warning = FALSE)

# 4.1 Library used installations:

# install.packages("tidytuesdayR")
# install.packages("tidyverse")
# install.packages("dplyr")
# install.packages("janitor")
# install.packages("skimr")
# install.packages("lubridate")
# install.packages("ggpubr")
# install.packages("data.table")
# install.packages("viridis")
# install.packages("leaflet")
# install.packages("htmlwidgets")
# install.packages("htmltools")
# install.packages("kableExtra")
# install.packages("here")
# install.packages("visdat")
# install.packages("readxl")
# install.packages("tidyr")
# install.packages("scales")
# install.packages("plotly")
# install.packages("forecast")
# install.packages("gridExtra")
# install.packages("formattable")
#install.packages("ggbreak")
#install.packages("naniar")



library ("tidytuesdayR")
library("tidyverse")
library("dplyr")
library("janitor")
library("skimr")
library("lubridate")
library("ggpubr")
library("data.table")
library("viridis")
library("leaflet")
library("htmlwidgets")
library("htmltools")
library("kableExtra")
library("here")
library("ggrepel")
library("ggbreak")
library("naniar")
library("ggplot2")
library("plotly")

#4.2 download Raw data

# options(timeout=300)   #used to prevent time out if i takes too long 
# # 4.2 download Raw data
# url <- "https://files.zillowstatic.com/research/public_csvs/invt_fs/Metro_invt_fs_uc_sfrcondo_month.csv?t=1730744841"
# download.file(url, "temp_data.csv")  # save as CSV, not ZIP
# inventory_data <- read.csv("temp_data.csv")     # read the downloaded CSV file



# Load the CSV file from the 'R_Econ_Data' folder
inventory_data  <- read.csv(here("R_Econ_Data", "Metro_invt_fs_uc_sfrcondo_month.csv"))


# Reshape from wide to long format
long_inventory <- inventory_data %>%
  pivot_longer(
    cols = starts_with("X"),     # Adjust to match the column names with 'X' prefix
    names_to = "Date",            # Create a new 'Date' column
    values_to = "Inventory"       # Move values into 'Inventory' column
  )

# Optional: Remove the 'X' from the 'Date' column for better formatting
long_inventory$Date <- gsub("^X", "", long_inventory$Date)  # Remove leading 'X'
long_inventory$Date <- as.Date(long_inventory$Date, format = "%Y.%m.%d")  # Convert to Date format


#url <- "https://divvy-tripdata.s3.amazonaws.com/202111-divvy-tripdata.zip"
#download.file(url, "temp_data.zip") 
#unzip("temp_data.zip")
#df1 <- read.csv("202111-divvy-tripdata.csv")

# Combine all 12 data file into one data set
# <- rbind(df1,df2)

# produce an overview of the data
#str(bike_rides)

#skim_without_charts(bike_rides)

#glimpse(bike_rides)

#4.2 download Raw data days pending


options(timeout=300)   #used to prevent time out if i takes too long
# 4.2 download Raw data

# url <- "https://files.zillowstatic.com/research/public_csvs/mean_doz_pending/Metro_mean_doz_pending_uc_sfrcondo_month.csv?t=1730744841"
# download.file(url, "temp_data.csv")  # save as CSV, not ZIP
# days_pnd_data <- read.csv("temp_data.csv")     # read the downloaded CSV file


days_pnd_data <- read.csv(here("R_Econ_Data", "Metro_mean_doz_pending_uc_sfrcondo_month.csv"))


# Reshape from wide to long format
long_pending <- days_pnd_data %>%
  pivot_longer(
    cols = starts_with("X"),  # Matches dates in MM/DD/YYYY format
    names_to = "Date",           # Create a new 'Date' column
    values_to = "Pending"        # Move values into 'Pending' column
  )

# Optional: Remove the 'X' from the 'Date' column for better formatting
long_pending$Date <- gsub("^X", "", long_pending$Date)  # Remove leading 'X'
long_pending$Date <- as.Date(long_pending$Date, format = "%m.%d.%Y") 



#options(timeout=300) 
# url <- "https://cfpb-hmda-public.s3.amazonaws.com/prod/data-browser/2023/filtered-queries/snapshot/83bbcb7a28f2a289eca9ec4043afca38.csv"
# download.file(url, "temp_data.csv")  # save as CSV, not ZIP
# mortapp23_data <- read.csv("temp_data.csv")     # read the downloaded CSV file
# 
# url <- "https://cfpb-hmda-public.s3.amazonaws.com/prod/data-browser/2022/filtered-queries/one-year/83bbcb7a28f2a289eca9ec4043afca38.csv"
# download.file(url, "temp_data.csv")  # save as CSV, not ZIP
# mortapp22_data <- read.csv("temp_data.csv")     # read the downloaded CSV file
# 
# url <- "https://cfpb-hmda-public.s3.amazonaws.com/prod/data-browser/2021/filtered-queries/one-year/83bbcb7a28f2a289eca9ec4043afca38.csv"
# download.file(url, "temp_data.csv")  # save as CSV, not ZIP
# mortapp21_data <- read.csv("temp_data.csv")     # read the downloaded CSV file
# 
# url <- "https://cfpb-hmda-public.s3.amazonaws.com/prod/data-browser/2020/filtered-queries/three-year/83bbcb7a28f2a289eca9ec4043afca38.csv"
# download.file(url, "temp_data.csv")  # save as CSV, not ZIP
# mortapp20_data <- read.csv("temp_data.csv")     # read the downloaded CSV file
# 
# url <- "https://cfpb-hmda-public.s3.amazonaws.com/prod/data-browser/2019/filtered-queries/three-year/83bbcb7a28f2a289eca9ec4043afca38.csv"
# download.file(url, "temp_data.csv")  # save as CSV, not ZIP
# mortapp19_data <- read.csv("temp_data.csv")     # read the downloaded CSV file
# 
# url <- "https://cfpb-hmda-public.s3.amazonaws.com/prod/data-browser/2018/filtered-queries/three-year/83bbcb7a28f2a289eca9ec4043afca38.csv"
# download.file(url, "temp_data.csv")  # save as CSV, not ZIP
# mortapp18_data <- read.csv("temp_data.csv")     # read the downloaded CSV file

# Load the CSV file from the 'E_Econ_Data' folder
mortapp23_data <- read.csv(here("R_Econ_Data", "state_HI_2023.csv"))
mortapp22_data <- read.csv(here("R_Econ_Data", "state_HI_2022.csv"))
mortapp21_data <- read.csv(here("R_Econ_Data", "state_HI_2021.csv"))
mortapp20_data <- read.csv(here("R_Econ_Data", "state_HI_2020.csv"))
mortapp19_data <- read.csv(here("R_Econ_Data", "state_HI_2019.csv"))
mortapp18_data <- read.csv(here("R_Econ_Data", "state_HI_2018.csv"))

#Combine all 12 data file into one data set
 mortapp_hi<- rbind(mortapp23_data,mortapp22_data,mortapp21_data,mortapp20_data,mortapp19_data,mortapp18_data)

rm(mortapp23_data,mortapp22_data,mortapp21_data,mortapp20_data,mortapp19_data,mortapp18_data)

#options(timeout=300) 
# url <- "https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1140&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=OBMMIC30YFLVGT80FB720A739&scale=left&cosd=2017-01-03&coed=2024-11-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=3&ost=-99999&oet=99999&mma=0&fml=a&fq=Daily&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2024-11-04&revision_date=2024-11-04&nd=2017-01-03"
# download.file(url, "temp_data.csv")  # save as CSV, not ZIP
# df4<- read.csv("temp_data.csv")     # read the downloaded CSV file


mort_rate_data <- read.csv(here("R_Econ_Data", "OBMMIC30YFLVGT80FB720A739.csv"))

#head(infl_rate_data_cl)



# options(timeout=300) 
# url <- "https://econdata.s3-us-west-2.amazonaws.com/Reports/Core/RDC_Inventory_Core_Metrics_Country_History.csv"
# download.file(url, "temp_data.csv")  # save as CSV, not ZIP
# price_Nat__data  <- read.csv("temp_data.csv")     # read the downloaded CSV file
# 
# url <- "https://econdata.s3-us-west-2.amazonaws.com/Reports/Core/RDC_Inventory_Core_Metrics_Metro_History.csv"
# download.file(url, "temp_data.csv")  # save as CSV, not ZIP
# price_Hi__data <- read.csv("temp_data.csv")     # read the downloaded CSV file



price_Nat__data <- read.csv(here("R_Econ_Data", "RDC_Inventory_Core_Metrics_Country_History.csv"))
#median listing price, active listing count, days in market, new listing count, price increase count, 
#price decrease count, median listing price sqft, ave listing price,total listing count,

price_Hi__data <- read.csv(here("R_Econ_Data", "RDC_Inventory_Core_Metrics_County_History.csv"))



# Convert 'month_date_yyyymm' to Date format
price_Nat__data$month_date_yyyymm <- as.Date(paste0(substr(price_Nat__data$month_date_yyyymm, 1, 4), "-",
                                                    substr(price_Nat__data$month_date_yyyymm, 5, 6), "-01"))

price_Hi__data$month_date_yyyymm <- as.Date(paste0(substr(price_Hi__data$month_date_yyyymm, 1, 4), "-",
                                                   substr(price_Hi__data$month_date_yyyymm, 5, 6), "-01"))


# Filter for Hawaii counties in price_Hi__data by identifying 'hi' in county_name
price_Hi_filtered <- price_Hi__data %>%
  filter(grepl(", hi", county_name, ignore.case = TRUE))

# Select relevant columns for merging
price_Nat_selected <- price_Nat__data %>%
  select(month_date_yyyymm, median_listing_price)

price_Hi_selected <- price_Hi_filtered %>%
  select(month_date_yyyymm, county_name, median_listing_price)

# Merge data on 'month_date_yyyymm' to combine country and Hawaii data
combined_data_price <- merge(price_Nat_selected, price_Hi_selected, by = "month_date_yyyymm", 
                       suffixes = c("_country", "_hi"))

# View the combined data
#print(combined_data)

rm(price_Hi_filtered ,price_Nat_selected,price_Nat__data , price_Hi__data  )

#write.csv(combined_data, "combined_data.csv", row.names = FALSE)

# options(timeout=300) 
# url <- "https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1140&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=DRSFRMACBS&scale=left&cosd=1991-01-01&coed=2024-07-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=3&ost=-99999&oet=99999&mma=0&fml=a&fq=Quarterly%2C%20End%20of%20Period&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2024-11-19&revision_date=2024-11-19&nd=1991-01-01"
# download.file(url, "temp_data.csv")  # save as CSV, not ZIP
# forclosure__data  <- read.csv("temp_data.csv")     # read the downloaded CSV file

foreclosure_data <- read.csv(here("R_Econ_Data", "DRSFRMACBS.csv"))


# options(timeout=300) 
# url <- "https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1140&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=RCMFLBACTDPDPCT90P&scale=left&cosd=2012-07-01&coed=2024-04-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Quarterly%2C%20End%20of%20Quarter&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2024-11-19&revision_date=2024-11-19&nd=2012-07-01"
# download.file(url, "temp_data.csv")  # save as CSV, not ZIP
# forclosure__data  <- read.csv("temp_data.csv")     # read the downloaded CSV file

foreclosure90_data <- read.csv(here("R_Econ_Data", "RCMFLBACTDPDPCT90P.csv"))



# options(timeout=300) 
# url <- "https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1140&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=MEHOINUSHIA646N&scale=left&cosd=1984-01-01&coed=2023-01-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Annual&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2024-11-04&revision_date=2024-11-04&nd=1984-01-01"
#  download.file(url, "temp_data.csv")  # save as CSV, not ZIP
# df6 <- read.csv("temp_data.csv")     # read the downloaded CSV file


#house hold income hawaii
hhi_income_data <- read.csv(here("R_Econ_Data", "MEHOINUSHIA646N.csv"))

# Assuming df is your data frame
hhi_income_data <- hhi_income_data %>%
  rename(income_hi = MEHOINUSHIA646N)

# options(timeout=300) 
# url <- "https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1140&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=MEHOINUSA646N&scale=left&cosd=1984-01-01&coed=2023-01-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Annual&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2024-11-04&revision_date=2024-11-04&nd=1984-01-01"
# download.file(url, "temp_data.csv")  # save as CSV, not ZIP
# husa_income_data <- read.csv("temp_data.csv")     # read the downloaded CSV file


#house hold income national 
husa_income_data <- read.csv(here("R_Econ_Data", "MEHOINUSA646N.csv"))


# options(timeout=300) 
# url <- "https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1140&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=CPILFESL&scale=left&cosd=1957-01-01&coed=2024-09-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2024-11-04&revision_date=2024-11-04&nd=1957-01-01"
# download.file(url, "temp_data.csv")  # save as CSV, not ZIP
# infl_rate_data  <- read.csv("temp_data.csv")     # read the downloaded CSV file



infl_rate_data <- read.csv(here("R_Econ_Data", "CPILFESL.csv"))


# options(timeout=300) 
# url <- "https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1140&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=CPIAUCSL&scale=left&cosd=1947-01-01&coed=2024-12-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=3&ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2025-01-15&revision_date=2025-01-15&nd=1947-01-01"
# download.file(url, "temp_data.csv")  # save as CSV, not ZIP
# infl_rate_data_2  <- read.csv("temp_data.csv")     # read the downloaded CSV file

infl_rate_data_2 <-read.csv(here("R_Econ_Data", "CPIAUCSL.csv"))




# options(timeout=300) 
# url <- "https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1140&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=DGS10&scale=left&cosd=2019-10-31&coed=2024-10-31&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=3&ost=-99999&oet=99999&mma=0&fml=a&fq=Daily&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2024-11-04&revision_date=2024-11-04&nd=1962-01-02"
# download.file(url, "temp_data.csv")  # save as CSV, not ZIP
# bond_mkty_rate_data <- read.csv("temp_data.csv")     # read the downloaded CSV file


bond_mkty_rate_data <- read.csv(here("R_Econ_Data", "DGS10.csv"))


homevalue <- read.csv(here("R_Econ_Data", "Metro_zhvi_uc_sfrcondo_tier_0.33_0.67_month.csv"))

# Reshape from wide to long format
L_homevalue <- homevalue %>%
  pivot_longer(
    cols = starts_with("X"),     # Adjust to match the column names with 'X' prefix
    names_to = "Date",            # Create a new 'Date' column
    values_to = "Value_dollers"       # Move values into 'Inventory' column
  )

# Optional: Remove the 'X' from the 'Date' column for better formatting
L_homevalue$Date <- gsub("^X", "", L_homevalue$Date)  # Remove leading 'X'
L_homevalue$Date <- as.Date(L_homevalue$Date, format = "%Y.%m.%d")  # Convert to Date format

# Create a new column with Month-Year only for display
L_homevalue$Month_Year <- format(L_homevalue$Date, "%m-%Y")  # Format as "Year-Month"

rm(homevalue)


# Rental value
rental <- read.csv(here("R_Econ_Data", "Metro_zori_uc_sfrcondomfr_sm_month.csv"))

# Reshape from wide to long format
L_rental<- rental %>%
  pivot_longer(
    cols = starts_with("X"),     # Adjust to match the column names with 'X' prefix
    names_to = "Date",            # Create a new 'Date' column
    values_to = "rent_dollers"       # Move values into 'Inventory' column
  )

# Optional: Remove the 'X' from the 'Date' column for better formatting
L_rental$Date <- gsub("^X", "", L_rental$Date)  # Remove leading 'X'
L_rental$Date <- as.Date(L_rental$Date, format = "%Y.%m.%d")  # Convert to Date format

# Create a new column with Month-Year only for display
L_rental$Month_Year <- format(L_rental$Date, "%m-%Y")  # Format as "Year-Month"

rm(rental)




# Filter for home values
L_homevalue_f <- L_homevalue %>%
  filter(grepl("United States", RegionName, ignore.case = TRUE))

L_rental_f <- L_rental %>%
  filter(grepl("United States", RegionName, ignore.case = TRUE))

# Select relevant columns for merging
L_rental_f <- L_rental_f %>%
  select(Month_Year, RegionName, rent_dollers)

L_homevalue_f<- L_homevalue_f %>%
  select(Month_Year, RegionName, Value_dollers)

# Merge data on 'month_date_yyyymm' to combine country and Hawaii data
cap_data <- merge(L_rental_f, L_homevalue_f, by = "Month_Year", 
                       suffixes = c("_rental", "_homevalue"))  
cap_data <- cap_data %>%
  mutate(monthly_exp = rent_dollers * 0.40)

cap_data <- cap_data %>%
  mutate(cap=(rent_dollers-monthly_exp)/Value_dollers)

cap_data <- cap_data %>%
  mutate(Month_Year = as.Date(paste0(Month_Year, "-01"), format = "%m-%Y-%d"))

rm(L_homevalue_f, L_rental_f)












# 
# # 4.3 make all column names text with proper formatting
long_inventory_cl<-clean_names(long_inventory)
long_pending_cl<-clean_names(long_pending)
mortapp_hi_cl<-clean_names(mortapp_hi)
mort_rate_data_cl<-clean_names(mort_rate_data)

#delete later
combined_data_price_cl<-clean_names(combined_data_price)

foreclosure_data_cl<-clean_names(foreclosure_data)
foreclosure90_data_cl<-clean_names(foreclosure90_data)
hhi_income_data_cl<-clean_names(hhi_income_data)
husa_income_data_cl<-clean_names(husa_income_data)
infl_rate_data_cl<-clean_names(infl_rate_data)
bond_mkty_rate_data_cl<-clean_names(bond_mkty_rate_data)


#rm(long_inventory,long_pending,mortapp_hi,mort_rate_data,foreclosure_data,foreclosure90_data,hhi_income_data,husa_income_data,infl_rate_data,bond_mkty_rate_data)

# Rename day to YMD:
#numbats <- numbats %>% rename(YMD = day)

# 4.5 the following checks all columns if numeric:
sapply(long_inventory_cl ,is.numeric)
sapply(long_pending_cl ,is.numeric)
sapply(mortapp_hi_cl ,is.numeric)

# this data found to be non numeric and found to have NA: NA blank are deleted first before proper numeric formatting
sapply(mort_rate_data_cl ,is.numeric)

sapply(combined_data_price_cl,is.numeric)
sapply(foreclosure_data_cl ,is.numeric)
sapply(foreclosure90_data_cl ,is.numeric)
sapply(hhi_income_data_cl ,is.numeric)
sapply(husa_income_data_cl ,is.numeric)
sapply(infl_rate_data_cl ,is.numeric)
sapply(bond_mkty_rate_data_cl ,is.numeric)


# Function to visualize missingness and compute summary stats
visualize_missingness_all <- function(dataframes_list) {
  plots <- list() # Initialize a list to store plots
  
  # Iterate over each dataframe in the list
  for (name in names(dataframes_list)) {
    data <- dataframes_list[[name]]
    
    # Visualize missing data patterns
      plot <- vis_miss(data) +
      ggtitle(paste("Missing Data Visualization for", name)) +
      theme_minimal() +
      theme(axis.text.x = element_text(angle = 45, hjust = 0)) # Slant x-axis labels
    print(plot) # Display each plot
    
    # Calculate missing percentages
    missing_percents <- sapply(data, function(x) mean(is.na(x))) * 100
    
    # Create a data frame for display
    missing_df <- data.frame(
      Column = names(missing_percents),
      MissingPercentage = round(missing_percents, 2)
    )
    
    # Display the table
    print(
      kable(missing_df, caption = paste("Percentage of Missing Values by Column -", name)) %>%
        kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)
    )
    
    # Save the plot in the list
    plots[[name]] <- plot
  }
  
  return(plots) # Return the list of plots
}

# Define the list of dataframes
dataframes_econ <- list(
  df1 = long_inventory_cl,
  df2 = long_pending_cl,
  #df3 = mortapp_hi_cl,
  df4 = mort_rate_data_cl,
  df5 = combined_data_price_cl,
  df6 = foreclosure_data_cl,
  df7 = foreclosure90_data_cl,
  df8 = hhi_income_data_cl,
  df9 = husa_income_data_cl,
  df10 = infl_rate_data_cl,
  df11 = bond_mkty_rate_data_cl
)

# Call the function and generate visualizations for all dataframes
plots <- visualize_missingness_all(dataframes_econ)

# Optional: Save plots to files
# lapply(names(plots), function(name) {
#   ggsave(filename = paste0(name, "_missingness_plot.png"), plot = plots[[name]])
# })
#4.5 Delete any rows with NA in the data
# Combine dataframes into a named list
dataframes_econ <- list(
  df1 = long_inventory_cl,
  df2 = long_pending_cl,
# will ignore mortapp_hi_cl has many NA due to how its formated
  #df3 = mortapp_hi_cl,
  df4 = mort_rate_data_cl,
  df5 = combined_data_price_cl,
  df6 = foreclosure_data_cl,
  df7 = foreclosure90_data_cl,
  df8 = hhi_income_data_cl,
  df9 = husa_income_data_cl,
  df10 = infl_rate_data_cl,
  df11 = bond_mkty_rate_data_cl
)

# Apply na.omit to each dataframe in the list
Cleaned_dataframes_cl_econ <- lapply(dataframes_econ, na.omit)

# Access cleaned dataframes by name
long_inventory_cl <- Cleaned_dataframes_cl_econ$df1
long_pending_cl <- Cleaned_dataframes_cl_econ$df2
#mortapp_hi_cl <- Cleaned_dataframes_cl_econ$df3
mort_rate_data_cl <- Cleaned_dataframes_cl_econ$df4
combined_data_price_cl <- Cleaned_dataframes_cl_econ$df5
foreclosure_data_cl <- Cleaned_dataframes_cl_econ$df6
foreclosure90_data_cl <- Cleaned_dataframes_cl_econ$df7
hhi_income_data_cl <- Cleaned_dataframes_cl_econ$df8
husa_income_data_cl <- Cleaned_dataframes_cl_econ$df9
infl_rate_data_cl <- Cleaned_dataframes_cl_econ$df10
bond_mkty_rate_data_cl <- Cleaned_dataframes_cl_econ$df11


#mort_rate_data_cl was found to not be numeric, after NA removal, it can now be converted;
mort_rate_data_cl$obmmic30yflvgt80fb720a739 <- as.numeric(mort_rate_data_cl$obmmic30yflvgt80fb720a739)


bond_mkty_rate_data_cl$dgs10 <- as.numeric(bond_mkty_rate_data_cl$dgs10)


# Define the list of dataframes
dataframes_econ <- list(
  df1 = long_inventory_cl,
  df2 = long_pending_cl,
  df4 = mort_rate_data_cl,
  df5 = combined_data_price_cl,
  df6 = foreclosure_data_cl,
  df7 = foreclosure90_data_cl,
  df8 = hhi_income_data_cl,
  df9 = husa_income_data_cl,
  df10 = infl_rate_data_cl,
  df11 = bond_mkty_rate_data_cl
)

# Define focus columns (columns to check for outliers)
focus_columns <- list(
  df1 = "inventory",
  df2 = "pending",
  df4 = "obmmic30yflvgt80fb720a739",
  df5 = c("median_listing_price_hi", "median_listing_price_country"),
  df6 = "drsfrmacbs",
  df7 = "rcmflbactdpdpct90p",
  df8 = "income_hi",
  df9 = "mehoinusa646n",
  df10 = "cpilfesl",
  df11 = "dgs10"
)

# Define x-axis columns (columns to be used as x-axis)
x_axis_columns <- list(
  df1 = "date",
  df2 = "date",
  df4 = "date",
  df5 = "month_date_yyyymm",
  df6 = "date",
  df7 = "date",
  df8 = "date",
  df9 = "date",
  df10 = "date",
  df11 = "date"
)

# Function to detect outliers in a single dataframe
detect_outliers <- function(df, focus_columns) {
  results <- list()
  
  for (column in focus_columns) {
    # Ensure the column exists
    if (!column %in% colnames(df)) {
      results[[column]] <- paste("Column", column, "not found in dataframe")
      next
    }
    
    # Ensure the column is numeric
    if (!is.numeric(df[[column]])) {
      results[[column]] <- paste("Column", column, "is not numeric")
      next
    }
    
    # Calculate quartiles and IQR
    Q1 <- quantile(df[[column]], 0.25, na.rm = TRUE)
    Q3 <- quantile(df[[column]], 0.75, na.rm = TRUE)
    IQR <- Q3 - Q1
    
    # Calculate lower and upper bounds
    lower_bound <- Q1 - 1.5 * IQR
    upper_bound <- Q3 + 1.5 * IQR
    
    # Identify outliers
    outliers <- which(df[[column]] < lower_bound | df[[column]] > upper_bound)
    outlier_values <- df[outliers, ]
    
    # Save results for the column
    results[[column]] <- list(
      num_outliers = length(outliers),
      outliers = outlier_values,
      lower_bound = lower_bound,
      upper_bound = upper_bound
    )
  }
  
  return(results)
}

# Apply the function to each dataframe using the specified focus columns
outlier_results <- mapply(
  FUN = detect_outliers,
  df = dataframes_econ,
  focus_columns = focus_columns,
  SIMPLIFY = FALSE
)

# Display the results for each dataframe (optional)
for (i in seq_along(outlier_results)) {
  cat("Results for", names(dataframes_econ)[i], ":\n")
  print(outlier_results[[i]])
  cat("\n")
}

# Function to plot data with outliers highlighted and cleaned x-axis
plot_outliers <- function(data, x_col, y_col, lower_bound, upper_bound, df_name, max_labels = 10) {
  # Ensure both x_col and y_col exist in the dataframe
  if (!x_col %in% colnames(data)) {
    message(paste("X-axis column", x_col, "not found in", df_name))
    return(NULL)
  }
  if (!y_col %in% colnames(data)) {
    message(paste("Y-axis column", y_col, "not found in", df_name))
    return(NULL)
  }
  
  # Adjust x-axis tick labels
  x_breaks <- if (is.numeric(data[[x_col]])) {
    # For numeric x-axis
    pretty(data[[x_col]], n = max_labels)
  } else if (inherits(data[[x_col]], "Date")) {
    # For date x-axis
    seq(min(data[[x_col]], na.rm = TRUE), max(data[[x_col]], na.rm = TRUE), length.out = max_labels)
  } else {
    # For categorical x-axis, take unique levels
    unique(data[[x_col]])[seq(1, length(unique(data[[x_col]])), length.out = max_labels)]
  }
  
  # Create the plot
  p <- ggplot(data, aes_string(x = x_col, y = y_col)) +
    geom_point(aes(color = (get(y_col) < lower_bound | get(y_col) > upper_bound))) +
    geom_hline(yintercept = lower_bound, linetype = "dashed", color = "blue") +
    geom_hline(yintercept = upper_bound, linetype = "dashed", color = "blue") +
    labs(
      title = paste("Outlier Visualization for", y_col, "in", df_name),
      x = x_col,
      y = y_col
    ) +
    scale_color_manual(values = c("FALSE" = "black", "TRUE" = "red"), guide = "none") +
    theme_minimal() +
    theme(
      axis.text.x = element_text(angle = 45, hjust = 1, size = 8)  # Adjust label rotation and size
    )
  
  # Add adjusted breaks for the x-axis
  if (is.numeric(data[[x_col]])) {
    p <- p + scale_x_continuous(breaks = x_breaks)
  } else if (inherits(data[[x_col]], "Date")) {
    p <- p + scale_x_date(breaks = x_breaks, date_labels = "%b %Y")  # Customize date format
  } else {
    p <- p + scale_x_discrete(breaks = x_breaks)
  }
  
  # Display the plot
  print(p)
}

# Iterate over dataframes and plot outliers
for (df_name in names(dataframes_econ)) {
  df <- dataframes_econ[[df_name]]
  focus_col <- focus_columns[[df_name]]
  x_col <- x_axis_columns[[df_name]] # Get x-axis column for this dataframe
  
  if (is.character(focus_col)) {
    for (col in focus_col) {
      # Skip if bounds are not available
      if (is.null(outlier_results[[df_name]][[col]])) next
      
      # Extract bounds
      lower_bound <- outlier_results[[df_name]][[col]]$lower_bound
      upper_bound <- outlier_results[[df_name]][[col]]$upper_bound
      
      # For specific dataframes, reduce x-axis clutter
      if (df_name %in% c("df4", "df6", "df10", "df11")) {
        plot_outliers(df, x_col, col, lower_bound, upper_bound, df_name, max_labels = 30)
      } else {
        plot_outliers(df, x_col, col, lower_bound, upper_bound, df_name)
      }
    }
  }
}
#4.7 check columns to see the number of unique values.

# # Initialize an empty data frame to store results
# unique_values_df <- data.frame(Column_Name = character(), Number_of_Unique_Values = integer(), stringsAsFactors = FALSE)
# 
# # Iterate through each column in the numbats dataset
# for (column_name in names(numbats)) {
#   # Check if the column is of type character
#   if (is.character(numbats[[column_name]])) {
#     # Calculate the number of unique values
#     num_unique_values <- length(unique(numbats[[column_name]]))
#     # Append to the results data frame
#     unique_values_df <- rbind(unique_values_df, data.frame(Column_Name = column_name, Number_of_Unique_Values = num_unique_values))
#   }
# }
# 
# # check what each unqiue value is on each column:
# 
# # Determine the maximum number of unique values in any character column
# max_length <- max(sapply(numbats, function(col) if (is.character(col)) length(unique(col)) else 0))
# 
#     # Create a list and add the counter column first
# list_of_unique_vals <- list(Counter = 1:max_length)
# 
# # Iterate through each column to get unique values
# for (column_name in names(numbats)) {
#   if (is.character(numbats[[column_name]])) {
#     # Get unique values and adjust the length by adding NAs if necessary
#     unique_vals <- unique(numbats[[column_name]])
#     lengthened_vals <- c(unique_vals, rep(NA, max_length - length(unique_vals)))
#     list_of_unique_vals[[column_name]] <- lengthened_vals
#   }
# }
# 
# # table
# kable(unique_values_df, "html") %>%
#   kableExtra::kable_classic(full_width = F) %>%
#   kable_styling(full_width = TRUE, position = "left") %>% 
#   column_spec(1, width = "3cm") %>%
#   column_spec(2, width = "3cm")
# # Standardize values if necessary:
# 
# #df$categorical_column <- tolower(df$categorical_column) 
# 
# # Convert to lower case
# 
# 
# #Results: no unusual unique values
# #cleaning
# rm( max_length, list_of_unique_vals)
check_and_flag_whitespace_all <- function(dataframes_list) {
  results <- list()  # Initialize a list to store the results for each dataframe
  flagged_rows_list <- list()  # List to store flagged rows for each dataframe and column
  
  # Iterate over each dataframe in the list
  for (name in names(dataframes_list)) {
    df <- dataframes_list[[name]]  # Get the current dataframe
    
    # Initialize a vector to store the sum of flagged cells for each column
    whitespace_counts <- integer(length = ncol(df))
    
    # Loop through each column in the dataframe
    for (i in seq_along(df)) {
      # Check if the column is of type character
      if (is.character(df[[i]])) {
        # Flag cells with leading or trailing whitespace
        flagged_cells <- grepl("^\\s+|\\s+$", df[[i]])
        # Sum the flagged cells and store the result
        whitespace_counts[i] <- sum(flagged_cells)
        
        # If flagged cells exist, extract flagged rows for the current column
        if (sum(flagged_cells) > 0) {
          flagged_rows <- df[flagged_cells, ]
          flagged_rows_list[[paste(name, names(df)[i], sep = "_")]] <- flagged_rows
        }
      }
    }
    
    # Create a named vector with the counts of whitespace per column
    whitespace_summary <- setNames(whitespace_counts[whitespace_counts != 0], names(df)[whitespace_counts != 0])
    
    # Print the summary for the current dataframe
    cat("\nResults for dataframe:", name, "\n")
    print(whitespace_summary)
    
    # Store the summary in the results list
    results[[name]] <- whitespace_summary
  }
  
  return(list(Summaries = results, FlaggedRows = flagged_rows_list))  # Return both summaries and flagged rows
}

# Example usage with your dataframes
dataframes_econ <- list(
  df1 = long_inventory_cl,
  df2 = long_pending_cl,
  df3 = mortapp_hi_cl, # Uncomment if available
  df4 = mort_rate_data_cl,
  df5 = combined_data_price_cl,
  df6 = foreclosure_data_cl,
  df7 = foreclosure90_data_cl,
  df8 = hhi_income_data_cl,
  df9 = husa_income_data_cl,
  df10 = infl_rate_data_cl,
  df11 = bond_mkty_rate_data_cl
)

# Call the function and check for whitespace across all dataframes
results <- check_and_flag_whitespace_all(dataframes_econ)

# Access summaries of whitespace counts
print(results$Summaries)

# Access flagged rows for inspection
print(results$FlaggedRows)

# 
# # Function to check and handle duplicate rows across multiple dataframes
# check_and_handle_duplicates_all <- function(dataframes_list, remove_duplicates = FALSE) {
#   results <- list()  # Initialize a list to store duplicate row summaries
#   cleaned_dataframes <- list()  # Initialize a list to store cleaned dataframes (if duplicates are removed)
#   
#   # Iterate over each dataframe in the list
#   for (name in names(dataframes_list)) {
#     df <- dataframes_list[[name]]  # Get the current dataframe
#     
#     # Check for duplicate rows
#     duplicated_rows <- duplicated(df)  # Logical vector indicating duplicate rows
#     num_duplicates <- sum(duplicated_rows)  # Count the number of duplicate rows
#     
#     # Print duplicate information for the current dataframe
#     cat("\nResults for dataframe:", name, "\n")
#     cat("Number of duplicate rows:", num_duplicates, "\n")
#     
#     # Store duplicate summary in results
#     results[[name]] <- list(
#       TotalRows = nrow(df),
#       DuplicateRows = num_duplicates,
#       PercentageDuplicates = round(100 * num_duplicates / nrow(df), 2)
#     )
#     
#     # Remove duplicates if specified
#     if (remove_duplicates) {
#       cleaned_df <- df[!duplicated_rows, ]  # Keep only unique rows
#       cleaned_dataframes[[name]] <- cleaned_df  # Save the cleaned dataframe
#     } else {
#       cleaned_dataframes[[name]] <- df  # Save the original dataframe unchanged
#     }
#   }
#   
#   # Return the results and optionally cleaned dataframes
#   return(list(DuplicateSummaries = results, CleanedDataframes = cleaned_dataframes))
# }
# 
# # Example usage with your dataframes
# dataframes_econ <- list(
#   df1 = long_inventory_cl,
#   df2 = long_pending_cl,
#   df3 = mortapp_hi_cl, # Uncomment if available
#   df4 = mort_rate_data_cl,
#   df5 = combined_data_price_cl,
#   df6 = foreclosure_data_cl,
#   df7 = foreclosure90_data_cl,
#   df8 = hhi_income_data_cl,
#   df9 = husa_income_data_cl,
#   df10 = infl_rate_data_cl,
#   df11 = bond_mkty_rate_data_cl
# )
# 
# # Call the function to check for duplicates (without removing them)
# results <- check_and_handle_duplicates_all(dataframes_econ)
# 
# # Print summaries of duplicate rows
# print(results$DuplicateSummaries)
# 
# # If duplicates need to be removed, set `remove_duplicates = TRUE`
# results_with_cleaned_data <- check_and_handle_duplicates_all(dataframes_econ, remove_duplicates = TRUE)
# 
# # Access cleaned dataframes
# cleaned_dataframes <- results_with_cleaned_data$CleanedDataframes
# 


# Statistical Summery

#SELECT 
#    member_casual,
#    COUNT(*) as total_trips,
#    AVG(ride_length_min) as avg_ride_length_mins,
#    SUM(ride_length_min) / 60 as total_bike_ride_length_hr,
#    SUM(CASE WHEN rideable_type = 'electric_bike' THEN ride_length_min ELSE 0 END) / 60 as total_electric_bike_ride_length_hr,
#    SUM(CASE WHEN rideable_type = 'docked_bike' THEN ride_length_min ELSE 0 END) / 60 as total_docked_bike_ride_length_hr,
#    SUM(CASE WHEN rideable_type = 'classic_bike' THEN ride_length_min ELSE 0 END) / 60 as total_classic_bike_ride_length_hr
#FROM bike_ride_clean
#WHERE start_station_name IS NOT NULL AND end_station_name IS NOT NULL -- Exclude rows with null values
#GROUP BY member_casual;

# 5.1 Inflation
library(dplyr)
library(ggplot2)
library(plotly)  # Load plotly for interactivity

# Copy the dataset to a new variable
infl_rate_data_cl <- infl_rate_data

# Convert the DATE column to Date type
infl_rate_data_cl$DATE <- as.Date(infl_rate_data_cl$DATE)

# Convert the second column (CPIAUCSL) to numeric
infl_rate_data_cl[[2]] <- as.numeric(infl_rate_data_cl[[2]])

# Add a column for Month-to-Month Inflation Rate
infl_rate_data_cl <- infl_rate_data_cl %>%
  mutate(Inflation_MoM = (infl_rate_data_cl[[2]] / lag(infl_rate_data_cl[[2]]) - 1) * 100)

# Add a column for Year-over-Year Inflation Rate
infl_rate_data_cl <- infl_rate_data_cl %>%
  mutate(Inflation_YoY = (infl_rate_data_cl[[2]] / lag(infl_rate_data_cl[[2]], 12) - 1) * 100)

# Order by date
infl_rate_data_cl <- infl_rate_data_cl %>%
  arrange(DATE)

# Create the CPI plot
cpi_plot <- ggplot(infl_rate_data_cl, aes(x = DATE, y = infl_rate_data_cl[[2]])) +
  geom_line(color = "blue") +
  labs(
    title = "CPI for All Urban Consumers: All Items Less Food and Energy",
    x = "Date",
    y = "CPI Index"
  ) +
  theme_minimal()

# Convert CPI plot to interactive plot
interactive_cpi_plot <- ggplotly(cpi_plot, tooltip = c("x", "y"))

# Create the Month-to-Month Inflation Rate plot
inflation_mom_plot <- ggplot(infl_rate_data_cl, aes(x = DATE, y = Inflation_MoM)) +
  geom_line(color = "red") +
  labs(
    title = "Month-to-Month Inflation Rate",
    x = "Date",
    y = "Inflation Rate (%)"
  ) +
  theme_minimal()

# Convert Month-to-Month Inflation Rate plot to interactive plot
interactive_mom_plot <- ggplotly(inflation_mom_plot, tooltip = c("x", "y"))

# Create the Year-over-Year Inflation Rate plot
inflation_yoy_plot <- ggplot(infl_rate_data_cl, aes(x = DATE, y = Inflation_YoY)) +
  geom_line(color = "green") +
  labs(
    title = "Year-over-Year Inflation Rate (Economist Method)",
    x = "Date",
    y = "Inflation Rate (%)"
  ) +
  theme_minimal()

# Convert Year-over-Year Inflation Rate plot to interactive plot
interactive_yoy_plot <- ggplotly(inflation_yoy_plot, tooltip = c("x", "y"))

# Display the interactive plots
interactive_cpi_plot
interactive_mom_plot
interactive_yoy_plot


#5.1 Inflation
infl_rate_data_2_cl <- infl_rate_data_2
infl_rate_data_2_cl$observation_date <- as.Date(infl_rate_data_2_cl$observation_date)



# Convert the second column to numeric
infl_rate_data_2_cl[[2]] <- as.numeric(infl_rate_data_2_cl[[2]])

# Check if the conversion worked for numeric columns
#sapply(infl_rate_data_2_cl, is.numeric)

# Order by date 
infl_rate_data_2_cl <- infl_rate_data_2_cl %>%
  arrange(observation_date)

# Plot the line graph
ggplot(infl_rate_data_2_cl, aes(x = observation_date, y = infl_rate_data_2_cl[[2]])) +
  geom_line(color = "blue") +
  labs(title = " CPI for All Urban Consumers: All Items US City Average", x = "Date", y = "Index") +
  theme_minimal()



#5.6 bond market
bond_mkty_rate_data_cl <- bond_mkty_rate_data
bond_mkty_rate_data_cl$DATE <- as.Date(bond_mkty_rate_data_cl$DATE)

# Convert the second column to numeric
bond_mkty_rate_data_cl[[2]] <- as.numeric(bond_mkty_rate_data_cl[[2]])

# Check if the conversion worked for numeric columns
#sapply(bond_mkty_rate_data_cl, is.numeric)

# Order by date 
bond_mkty_rate_data_cl <- bond_mkty_rate_data_cl %>%
  arrange(DATE)

# Plot the line graph
ggplot(bond_mkty_rate_data_cl, aes(x = DATE, y = bond_mkty_rate_data_cl[[2]])) +
  geom_line(color = "blue") +
  labs(title = " Market Yield on U.S. Treasury Sec at 10 yr Const mtrty, Quoted on an Investment Basis", x = "Date", y = "Index") +
  theme_minimal()

mort_rate_data_cl <- mort_rate_data
mort_rate_data_cl$DATE <- as.Date(mort_rate_data_cl$DATE)

# Convert the second column to numeric
mort_rate_data_cl[[2]] <- as.numeric(mort_rate_data_cl[[2]])

# Check if the conversion worked for numeric columns
#sapply(mort_rate_data_cl, is.numeric)

# Order by date 
mort_rate_data_cl <- mort_rate_data_cl %>%
  arrange(DATE)

# Plot the line graph
ggplot(mort_rate_data_cl, aes(x = DATE, y = mort_rate_data_cl[[2]])) +
  geom_line(color = "blue") +
  labs(title = "National Mortgage Rate", x = "Date", y = "Percentage") +
  theme_minimal()


# Define a lookup table for county codes to county names
county_lookup <- c(
  '15001' = 'Hawaii_Cty',
  '15003' = 'Honolul_Cty',
  '15005' = 'Kalawao_Cty',
  '15007' = 'Kauai_Cty',
  '15009' = 'Maui_Cty'
)

# Summarize the data to get the count of mortgage applications per year and county
mortapp_summary <- mortapp_hi %>%
  
  # Filter the data to include only rows with county codes in the lookup table
  filter(county_code %in% names(county_lookup)) %>%
  
  # Group by activity_year and county_code
  group_by(activity_year, county_code) %>%
  
  # Summarise the mortgage application count
  summarise(mortgage_application_count = n(), .groups = 'drop') %>%
  
  # Replace county codes with county names using the lookup table
  mutate(county_name = county_lookup[as.character(county_code)]) %>%
  
  # Drop the county_code column and spread the data so that each county becomes a column
  select(-county_code) %>%
  pivot_wider(names_from = county_name, values_from = mortgage_application_count, values_fill = 0) %>%
  
  # Add a total mortgage count column
  rowwise() %>%
  mutate(total_mortgage_applications = sum(c_across(where(is.numeric)))) %>%
  
  # Reorder columns to place 'total_mortgage_applications' after 'activity_year'
  select(activity_year, total_mortgage_applications, everything())


# Plotting a line graph for total mortgage applications
ggplot(mortapp_summary, aes(x = activity_year)) +
  geom_line(aes(y = total_mortgage_applications, color = "Total Applications")) +
  geom_line(aes(y = Hawaii_Cty, color = "Hawaii County")) +
  geom_line(aes(y = Honolul_Cty, color = "Honolulu County")) +
  geom_line(aes(y = Kauai_Cty, color = "Kauai County")) +
  geom_line(aes(y = Maui_Cty, color = "Maui County")) +
  
  # Add labels and title
  labs(title = "Mortgage Applications in Hawaii (2018-2022)",
       x = "Year",
       y = "Number of Applications",
       color = "County") +
  
  # Customize the theme
  theme_minimal()


dwelling_category_year_summary <- mortapp_hi %>%
  
  # Group by activity_year and derived_dwelling_category
  group_by(activity_year, derived_dwelling_category) %>%
  
  # Summarise the count of appearances of each category by year
  summarise(count = n(), .groups = 'drop') %>%
  
  # Pivot wider to make derived_dwelling_category the columns
  pivot_wider(names_from = derived_dwelling_category, values_from = count, values_fill = 0) %>%
  
  # Arrange by activity_year
  arrange(activity_year)
# Reshape the data back to long format for easier plotting
dwelling_category_year_long <- dwelling_category_year_summary %>%
  pivot_longer(cols = -activity_year, names_to = "derived_dwelling_category", values_to = "count")

# Plotting with faceting for each dwelling category
ggplot(dwelling_category_year_long, aes(x = activity_year, y = count, group = derived_dwelling_category)) +
  geom_line() +
  
  # Facet the plot by dwelling category
  facet_wrap(~ derived_dwelling_category, scales = "free_y") +
  
  # Add labels and title
  labs(title = "Mortgage Applications by Dwelling Category (2018-2022)",
       x = "Year",
       y = "Number of Applications") +
  
  # Customize the theme
  theme_minimal()

# Select a starting point for the label on the national line (using the earliest date)
ustart_point_pnd <- combined_data_price_cl %>%
  filter(!is.na(median_listing_price_country)) %>%
  slice_min(order_by = month_date_yyyymm, n = 1)

# Plot the line graph with each Hawaii county in a unique color and the national line in blue
ggplot() +
  geom_line(data = combined_data_price_cl, 
            aes(x = month_date_yyyymm, y = median_listing_price_hi, color = county_name, group = county_name), 
            size = 1) +
  geom_line(data = combined_data_price_cl, 
            aes(x = month_date_yyyymm, y = median_listing_price_country), 
            color = "blue", size = 1.2) +
  geom_text(data = ustart_point_pnd, 
            aes(x = month_date_yyyymm, y = median_listing_price_country, 
                label = "Median Listing Price Nationally"), 
            hjust = -0.1, vjust = -1.5, size = 4, fontface = "bold", color = "blue") +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +  # Adds monthly labels with abbreviated month and year
  labs(title = "Realty Prices Over Time by County in Hawaii",
       x = "Date",
       y = "Median Listing Price",
       color = "County") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1,size = 5))  # Rotates x-axis labels for readability

# Calculate the median price by date if not already done
median_price_by_date <- combined_data_price_cl %>%
  group_by(month_date_yyyymm) %>%
  summarize(Median_Listing_Price_HI = median(median_listing_price_hi, na.rm = TRUE))

# Convert DATE to Date format and filter to include only dates from 2016 onwards
hhi_income_data <- hhi_income_data_cl %>%
  mutate(date = as.Date(date, format = "%Y-%m-%d")) %>%
  filter(date >= as.Date("2016-01-01"))

# Plotting
ggplot() +
  geom_line(data = median_price_by_date, 
            aes(x = month_date_yyyymm, y = Median_Listing_Price_HI / 1000, color = "Median Listing Price HI"), 
            size = 1) +  # Scale y-axis in thousands
  
  geom_line(data = hhi_income_data, 
            aes(x = date, y = income_hi / 1000, color = "Median Household Income HI"), 
            size = 1.2) +
  
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  scale_y_continuous(labels = scales::comma) +  # Comma formatting for y-axis
  
  labs(title = "Realty Prices Over Time vs Household Income in Hawaii",
       x = "Date",
       y = "Unit In $ (in thousands)",
       color = "") +

  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 5),
        plot.margin = unit(c(1, 1, 2, 1), "lines"))  # Extra bottom margin for annotation space



# Filter the data for Hawaii regions
hawaii_data_inv <- long_inventory %>%
  filter(StateName == "HI")

# Summarize the total inventory across all regions for each date
hawaii_summary <- hawaii_data_inv %>%
  group_by(Date) %>%
  summarise(Total_Inventory = sum(Inventory, na.rm = TRUE))

# Identify the starting point (first date) of the black line for annotation
start_point <- hawaii_summary %>%
  slice(1)  # First row (earliest date)

# Plot the line graph for all regions with a black summary line
ggplot() +
  # Layer 1: Line graph for each region
  geom_line(data = hawaii_data_inv, 
            aes(x = as.Date(Date), y = Inventory, color = RegionName, group = RegionName), 
            size = 1) +
  # Layer 2: Black summary line for total inventory
  geom_line(data = hawaii_summary, 
            aes(x = as.Date(Date), y = Total_Inventory), 
            color = "black", size = 1.2) +
  # Add a text label at the start of the black line
  geom_text(data = start_point, 
            aes(x = as.Date(Date), y = Total_Inventory, label = "Hawaii Total Inv"), 
            hjust = -0.1, vjust = 0, size = 4, fontface = "bold") +  # Adjust text position and size
  labs(
    title = "Housing Inventory Over Time by Region in Hawaii",
    x = "Date",
    y = "Inventory"
  ) +
  scale_x_date(
    date_breaks = "1 month",    # Show every month on the x-axis
    date_labels = "%b %Y"       # Format as "Jan 2024", "Feb 2024", etc.
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom",  # Place legend at the bottom
    legend.title = element_blank(),  # Remove legend title
    axis.text.x = element_text(angle = 45, hjust = 1, size = 5)  # Rotate x-axis labels
  )



write.csv(hhi_income_data, "hhi_income_data.csv", row.names = FALSE)


# Filter the data for United States only
us_data_inv <- long_inventory %>%
  filter(RegionName == "United States")

# Summarize the total inventory for United States by date
us_summary <- us_data_inv %>%
  group_by(Date) %>%
  summarise(Total_Inventory = sum(Inventory, na.rm = TRUE))

# Combine the US and Hawaii summaries
combined_summary <- bind_rows(
  us_summary %>% mutate(Region = "United States"),
  hawaii_summary %>% mutate(Region = "Hawaii")
)


# Plot with separate facets for each region
ggplot(combined_summary, aes(x = as.Date(Date), y = Total_Inventory, color = Region)) +
  geom_line(size = 1.2) +
  
  labs(
    title = "Housing Inventory Over Time: Hawaii vs. United States",
    x = "Date",
    y = "Total Inventory"
  ) +
  
  scale_x_date(
    date_breaks = "1 month",
    date_labels = "%b %Y"
  ) +
  
  facet_wrap(~ Region, scales = "free_y") +  # Separate y-axes for each region
  
  theme_minimal() +
  theme(
    legend.position = "none",  # Remove legend (since facets indicate regions)
    axis.text.x = element_text(angle = 45, hjust = 1, size = 5)  # Rotate x-axis labels
  )

# # Plot the line graph with separate lines for Hawaii and United States
# ggplot(data = combined_summary, aes(x = as.Date(Date), y = Total_Inventory, color = Region, group = Region)) +
#   geom_line(size = 1.2) +  # Adjust line thickness for better visibility
#   
#   labs(
#     title = "Housing Inventory Over Time: Hawaii vs. United States",
#     x = "Date",
#     y = "Total Inventory"
#   ) +
#   scale_x_date(
#     date_breaks = "1 month",    # Show every month on the x-axis
#     date_labels = "%b %Y"       # Format as "Jan 2024", "Feb 2024", etc.
#   ) +
#   theme_minimal() +
#   theme(
#     legend.position = "bottom",  # Place legend at the bottom
#     legend.title = element_blank(),  # Remove legend title
#     axis.text.x = element_text(angle = 45, hjust = 1, size = 5)  # Rotate x-axis labels
#   )

rm(us_data_inv,us_summary)


# Create a copy of the long_pending data
cl_long_pending <- long_pending

# Convert 'Date' columns to a Date object
cl_long_pending$Date <- as.Date(cl_long_pending$Date, format = "%m.%d.%Y")

# Filter the data for Hawaii regions
hawaii_data_pnd <- cl_long_pending %>%
  filter(StateName == "HI" )

# Summarize the average days on market across all Hawaii regions for each date
hawaii_summary_pnd <- hawaii_data_pnd %>%
  group_by(Date) %>%
  summarise(Average_Pending = mean(Pending, na.rm = TRUE))

# Identify the starting point (first date) of the black line for annotation
start_point_pnd <- hawaii_summary_pnd %>%
  slice(1)

# Plot the line graph for all Hawaii regions with a black summary line representing the average pending days
ggplot() +
  geom_line(data = hawaii_data_pnd, 
            aes(x = Date, y = Pending, color = RegionName, group = RegionName), 
            size = 1) +
  geom_line(data = hawaii_summary_pnd, 
            aes(x = Date, y = Average_Pending), 
            color = "black", size = 1.2) +
  geom_text(data = start_point_pnd, 
            aes(x = Date, y = Average_Pending, label = "Hawaii Average Pending Days"), 
            hjust = -0.1, vjust = 1, size = 4, fontface = "bold") +  
  labs(
    title = "Average Pending Days Over Time by Region in Hawaii",
    x = "Date",
    y = "Pending Days"
  ) +
  scale_x_date(
    date_breaks = "1 month",    
    date_labels = "%b %Y"       
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom",  
    legend.title = element_blank(),
    axis.text.x = element_text(angle = 45, hjust = 1, size = 5)
  )

# Filter data for Hawaii and the country
Uhawaii_data_pnd <- cl_long_pending %>%
  filter(StateName == "HI" | RegionType == "country")

# Summarize the average pending days across all Hawaii regions for each date
Uhawaii_summary_pnd <- Uhawaii_data_pnd %>%
  filter(StateName == "HI") %>%  # Only Hawaii regions for the black line
  group_by(Date) %>%
  summarise(UAverage_Pending = mean(Pending, na.rm = TRUE))

# Filter out only the United States line
uus_data_pnd <- Uhawaii_data_pnd %>%
  filter(RegionType == "country")  # Assuming 'country' represents the United States

# Identify the starting point (first date) of the black line for annotation
ustart_point_pnd <- Uhawaii_summary_pnd %>%
  slice(1)

# Plot the line graph showing only the black summary line and the United States line
ggplot() +
  geom_line(data = uus_data_pnd, 
            aes(x = Date, y = Pending), 
            color = "blue", size = 1) +
  geom_line(data = Uhawaii_summary_pnd, 
            aes(x = Date, y = UAverage_Pending), 
            color = "black", size = 1.2) +
  geom_text(data = ustart_point_pnd, 
            aes(x = Date, y = UAverage_Pending, label = "Hawaii Average Pending Days"), 
            hjust = -2, vjust = -4, size = 4, fontface = "bold") +  
  # Optional: add annotation for the United States line
  geom_text(data = uus_data_pnd %>% slice(1), 
            aes(x = Date, y = Pending, label = "US Average Days Pending"), 
            hjust = -0.1, vjust = 5, size = 4, fontface = "bold", color = "blue") +  
  labs(
    title = "Average Pending Days Over Time: Hawaii vs. United States",
    x = "Date",
    y = "Pending Days"
  ) +
  scale_x_date(
    date_breaks = "1 month",    
    date_labels = "%b %Y"       
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom",  
    legend.title = element_blank(),
    axis.text.x = element_text(angle = 45, hjust = 1, size = 5)
  )


foreclosure90_data$DATE <- as.Date(foreclosure90_data$DATE, format = "%Y-%m-%d")


# Plot the line graph
ggplot(foreclosure90_data, aes(x = DATE, y = RCMFLBACTDPDPCT90P)) +
  geom_line(color = "red") +
  scale_x_date(date_labels = "%b %Y", date_breaks = "3 month") +
  labs(title = " Large Bank Consumer Mortgage Balances: 90 or More Days Past Due", x = "Date", y = "Percentage") +
  theme_minimal()+
   theme(axis.text.x = element_text(angle = 45, hjust = 1,size = 5))  # Rotates x-axis labels for readability


# scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +  # Adds monthly labels with abbreviated month and year
 # labs(title = "Realty Prices Over Time by County in Hawaii",
     #  x = "Date",
      # y = "Median Listing Price",
#colnames(foreclosure90_data)


#colnames(mort_rate_data)[colnames(mort_rate_data) == "DRSFRMACBS"] <- "DelinquencyRate

foreclosure_data$DATE <- as.Date(foreclosure_data$DATE, format = "%Y-%m-%d")


# Plot the line graph
ggplot(foreclosure_data, aes(x = DATE, y = DRSFRMACBS)) +
  geom_line(color = "blue") +
  scale_x_date(date_labels = "%b %Y", date_breaks = "6 month") +
  labs(title = " Delinquency Rate on Single-Family Residential Mortgages", x = "Date", y = "Percentage") +
  theme_minimal()+
   theme(axis.text.x = element_text(angle = 45, hjust = 1,size = 5))  # Rotates x-axis labels for readability

#colnames(foreclosure_data)


mort_rate_data_cl <- mort_rate_data
mort_rate_data_cl$DATE <- as.Date(mort_rate_data_cl$DATE)

# Convert the second column to numeric
mort_rate_data_cl[[2]] <- as.numeric(mort_rate_data_cl[[2]])

# Check if the conversion worked for numeric columns
sapply(mort_rate_data_cl, is.numeric)

# Order by date 
mort_rate_data_cl <- mort_rate_data_cl %>%
  arrange(DATE)

# Plot the line graph
ggplot(mort_rate_data_cl, aes(x = DATE, y = mort_rate_data_cl[[2]])) +
  geom_line(color = "blue") +
  labs(title = "National Mortgage Rate", x = "Date", y = "Percentage") +
  theme_minimal()


# Define a lookup table for county codes to county names
county_lookup <- c(
  '15001' = 'Hawaii_Cty',
  '15003' = 'Honolul_Cty',
  '15005' = 'Kalawao_Cty',
  '15007' = 'Kauai_Cty',
  '15009' = 'Maui_Cty'
)

# Summarize the data to get the count of mortgage applications per year and county
mortapp_summary <- mortapp_hi %>%
  
  # Filter the data to include only rows with county codes in the lookup table
  filter(county_code %in% names(county_lookup)) %>%
  
  # Group by activity_year and county_code
  group_by(activity_year, county_code) %>%
  
  # Summarise the mortgage application count
  summarise(mortgage_application_count = n(), .groups = 'drop') %>%
  
  # Replace county codes with county names using the lookup table
  mutate(county_name = county_lookup[as.character(county_code)]) %>%
  
  # Drop the county_code column and spread the data so that each county becomes a column
  select(-county_code) %>%
  pivot_wider(names_from = county_name, values_from = mortgage_application_count, values_fill = 0) %>%
  
  # Add a total mortgage count column
  rowwise() %>%
  mutate(total_mortgage_applications = sum(c_across(where(is.numeric)))) %>%
  
  # Reorder columns to place 'total_mortgage_applications' after 'activity_year'
  select(activity_year, total_mortgage_applications, everything())


# Plotting a line graph for total mortgage applications
ggplot(mortapp_summary, aes(x = activity_year)) +
  geom_line(aes(y = total_mortgage_applications, color = "Total Applications")) +
  geom_line(aes(y = Hawaii_Cty, color = "Hawaii County")) +
  geom_line(aes(y = Honolul_Cty, color = "Honolulu County")) +
  geom_line(aes(y = Kauai_Cty, color = "Kauai County")) +
  geom_line(aes(y = Maui_Cty, color = "Maui County")) +
  
  # Add labels and title
  labs(title = "Mortgage Applications in Hawaii (2018-2022)",
       x = "Year",
       y = "Number of Applications",
       color = "County") +
  
  # Customize the theme
  theme_minimal()


dwelling_category_year_summary <- mortapp_hi %>%
  
  # Group by activity_year and derived_dwelling_category
  group_by(activity_year, derived_dwelling_category) %>%
  
  # Summarise the count of appearances of each category by year
  summarise(count = n(), .groups = 'drop') %>%
  
  # Pivot wider to make derived_dwelling_category the columns
  pivot_wider(names_from = derived_dwelling_category, values_from = count, values_fill = 0) %>%
  
  # Arrange by activity_year
  arrange(activity_year)
# Reshape the data back to long format for easier plotting
dwelling_category_year_long <- dwelling_category_year_summary %>%
  pivot_longer(cols = -activity_year, names_to = "derived_dwelling_category", values_to = "count")

# Plotting with faceting for each dwelling category
ggplot(dwelling_category_year_long, aes(x = activity_year, y = count, group = derived_dwelling_category)) +
  geom_line() +
  
  # Facet the plot by dwelling category
  facet_wrap(~ derived_dwelling_category, scales = "free_y") +
  
  # Add labels and title
  labs(title = "Mortgage Applications by Dwelling Category (2018-2022)",
       x = "Year",
       y = "Number of Applications") +
  
  # Customize the theme
  theme_minimal()

# Select a starting point for the label on the national line (using the earliest date)
ustart_point_pnd <- combined_data_price_cl %>%
  filter(!is.na(median_listing_price_country)) %>%
  slice_min(order_by = month_date_yyyymm, n = 1)

# Plot the line graph with each Hawaii county in a unique color and the national line in blue
ggplot() +
  geom_line(data = combined_data_price_cl, 
            aes(x = month_date_yyyymm, y = median_listing_price_hi, color = county_name, group = county_name), 
            size = 1) +
  geom_line(data = combined_data_price_cl, 
            aes(x = month_date_yyyymm, y = median_listing_price_country), 
            color = "blue", size = 1.2) +
  geom_text(data = ustart_point_pnd, 
            aes(x = month_date_yyyymm, y = median_listing_price_country, 
                label = "Median Listing Price Nationally"), 
            hjust = -0.1, vjust = -1.5, size = 4, fontface = "bold", color = "blue") +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +  # Adds monthly labels with abbreviated month and year
  labs(title = "Realty Prices Over Time by County in Hawaii",
       x = "Date",
       y = "Median Listing Price",
       color = "County") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1,size = 5))  # Rotates x-axis labels for readability

# Calculate the median price by date if not already done
median_price_by_date <- combined_data_price_cl %>%
  group_by(month_date_yyyymm) %>%
  summarize(Median_Listing_Price_HI = median(median_listing_price_hi, na.rm = TRUE))

# Convert DATE to Date format and filter to include only dates from 2016 onwards
hhi_income_data <- hhi_income_data_cl %>%
  mutate(date = as.Date(date, format = "%Y-%m-%d")) %>%
  filter(date >= as.Date("2016-01-01"))

# Plotting
ggplot() +
  geom_line(data = median_price_by_date, 
            aes(x = month_date_yyyymm, y = Median_Listing_Price_HI / 1000, color = "Median Listing Price HI"), 
            size = 1) +  # Scale y-axis in thousands
  
  geom_line(data = hhi_income_data, 
            aes(x = date, y = income_hi / 1000, color = "Median Household Income HI"), 
            size = 1.2) +
  
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  scale_y_continuous(labels = scales::comma) +  # Comma formatting for y-axis
  
  labs(title = "Realty Prices Over Time vs Household Income in Hawaii",
       x = "Date",
       y = "Unit In $ (in thousands)",
       color = "") +

  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 5),
        plot.margin = unit(c(1, 1, 2, 1), "lines"))  # Extra bottom margin for annotation space



#5.6 bond market
bond_mkty_rate_data_cl <- bond_mkty_rate_data
bond_mkty_rate_data_cl$DATE <- as.Date(bond_mkty_rate_data_cl$DATE)

# Convert the second column to numeric
bond_mkty_rate_data_cl[[2]] <- as.numeric(bond_mkty_rate_data_cl[[2]])

# Check if the conversion worked for numeric columns
sapply(bond_mkty_rate_data_cl, is.numeric)

# Order by date 
bond_mkty_rate_data_cl <- bond_mkty_rate_data_cl %>%
  arrange(DATE)

# Plot the line graph
ggplot(bond_mkty_rate_data_cl, aes(x = DATE, y = bond_mkty_rate_data_cl[[2]])) +
  geom_line(color = "blue") +
  labs(title = " Market Yield on U.S. Treasury Sec at 10 yr Const mtrty, Quoted on an Investment Basis", x = "Date", y = "Index") +
  theme_minimal()


# Plot the line graph
# ggplot(cap_data, aes(x = Month_Year, y = cap)) +
#   geom_line(color = "blue") +
#   geom_line(data=bond_mkty_rate_data_cl, 
#             aes(x = DATE, y = DGS10), 
#             color = "black", size = .5) +
#   labs(title = "CAP and Realty Trends", x = "Monthly Expenses", y = "CAP Rate") +
#   theme_minimal()

#write.csv(cap_data, "cap.csv", row.names = FALSE)

# #5.1 Heat Map
# # Create data frame that summarize the number of trips by date to 
#    #find when is the most popular time bike share is used
# heat_map_data1 <- bike_rides_cleaned %>%
#   
#   select(
#     YMD, 
#     started_at,
#     week,
#     year,
#     member_casual
#   ) %>%
#   filter(member_casual=="member")
# 
# #create a column for hour
# heat_map_data1$hour <- format(
#   heat_map_data1$started_at, 
#   "%H"
# )
# 
# #create a column for week
# heat_map_data1$week <-
#   as.numeric(heat_map_data1$week)
# 
# 
# heat_map_data1 <-heat_map_data1 %>%
#   group_by(
#     YMD,hour
#   ) %>%
#   
#   mutate(
#     numtrips = n()
#   ) %>%
#   
#   distinct(
#     YMD, 
#     .keep_all = TRUE
#   )
# 
# 
# 
# # Arrange hours in order for y-axis
# heat_map_data1$hour<- ordered(
#   heat_map_data1$hour, 
#   levels = c(
#     "00","01", "02", "03", "04",
#     "05", "06", "07","08","09","10","11","12","13","14","15","16",
#     "17","18","19","20","21","22","23", "24"
#   )
# )
# 
# # Create a heat map to show most popular time of year 
# hhp3 <- ggplot(
#   heat_map_data1,
#   aes(
#     x = week, 
#     y = hour, 
#     fill = numtrips
#   )
# ) + 
#   
#   # Use the viridis colour scheme to show the popularity of each day
#   scale_fill_viridis(
#     option = "D",
#     direction = 1,
#     name = "Number of Trips"
#   ) +
#   
#   # Create a rectangular heat map
#   geom_tile(
#     colour = "white", 
#     na.rm = FALSE
#   ) + 
#   
#   # Separate the heat maps by year
#   facet_wrap(
#     "year", 
#     ncol = 1
#   ) + 
#   
#   # Reverse the y-axis so that the hour days read vertically Monday to Sunday 
#   scale_y_discrete(
#     limits = rev
#   ) +
#   
#   # Add x-axis labels to show the months of the year
#   scale_x_continuous(
#     expand = c(0, 0),
#     breaks = seq(1, 52, length = 12),
#     labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", 
#                "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
#   ) +
#   
#   # Set the light theme 
#   theme_light() +
#   
# # Add a title 
# labs(title = "Member Riders" , y= "Hour of the Day" ,x="Months") 
# 
# 
# 
# # Popular time for casual members to use bike share
# # Create data frame that summaries the number of trips by date 
# heat_map_data2 <- bike_rides_cleaned %>%
#   
#   select(
#     YMD, 
#     started_at,
#     week,
#     year,
#     member_casual
#   ) %>%
#   filter(member_casual=="casual")
# 
# #create a column for hour
# heat_map_data2$hour <- format(
#   heat_map_data2$started_at, 
#   "%H"
# )
# 
# #create a column for week
# heat_map_data2$week <-
#   as.numeric(heat_map_data2$week)
# 
# 
# # Organize the heat map and count for each bike ride
# heat_map_data2 <-heat_map_data2 %>%
#   group_by(
#     YMD, hour
#   ) %>%
#   
#   mutate(
#     numtrips = n()
#   ) %>%
#   
#   distinct(
#     YMD, 
#     .keep_all = TRUE
#   )
# 
# 
# 
# # Arrange hours in order for y-axis
# heat_map_data2$hour<- ordered(
#   heat_map_data2$hour, 
#   levels = c(
#     "00","01", "02", "03", "04",
#     "05", "06", "07","08","09","10","11","12","13","14","15","16",
#     "17","18","19","20","21","22","23", "24"
#   )
# )
# 
# # Create a heat map to show most popular time of year 
# hhp4 <- ggplot(
#   heat_map_data2,
#   aes(
#     x = week, 
#     y = hour, 
#     fill = numtrips
#   )
# ) + 
#   
#   # Use the viridis colour scheme to show the popularity of each day
#   scale_fill_viridis(
#     option = "D",
#     direction = 1,
#     name = "Number of Trips"
#   ) +
#   
#   # Create a rectangular heat map
#   geom_tile(
#     colour = "white", 
#     na.rm = FALSE
#   ) + 
#   
#   # Separate the heat maps by year
#   facet_wrap(
#     "year", 
#     ncol = 1
#   ) + 
#   
#   # Reverse the y-axis so that the hourdays read vertically Monday to Sunday 
#   scale_y_discrete(
#     limits = rev
#   ) +
#   
#   # Add x-axis labels to show the months of the year
#   scale_x_continuous(
#     expand = c(0, 0),
#     breaks = seq(1, 52, length = 12),
#     labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", 
#                "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
#   ) +
#   
#   # Set the light theme 
#   theme_light() +
#   
#   # Add a title 
#   labs(title = "Casual Riders" , y= "Hour of the Day" ,x="Months") 
#   
# 
# 
# #Combine Casual and Annual Member heat map
# hhp2a <- ggarrange(
#   hhp3, 
#   hhp4, 
#   ncol = 1, 
#   nrow = 2,
#   common.legend = TRUE, 
#   legend = "right"
# )
#5.2 Bar Graph
#Finding the most popular day of the week to ride in the last 12 months.
# week_day1 <- bike_rides_cleaned %>%
#   
#   select(
#     day_of_week
#   ) %>%
#   group_by(day_of_week) %>%
#   mutate(
#     numtrips=n()
#   ) %>%
#   distinct(
#     day_of_week, 
#     .keep_all = TRUE
#   ) %>%
#   arrange(
#     day_of_week
#   )
# 
# # Adding x- axis labels 
# week_day1$day_of_week<- ordered(
#   week_day1$day_of_week, 
#   levels = c(
#     "Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"
#   )
# )
# 
# #Plotting the bar graph
# Weekbar<- week_day1 %>%
#   ggplot(aes(x = day_of_week, y = numtrips)) +
#   geom_bar(stat = "identity", 
#            position = "dodge",
#            aes(fill=numtrips))+
#   labs(title = " Day of Week most Popular ", y="Total Bike Rides per day", x= "")
# 
# 
# # Members:
# member_weekday1a <- bike_rides_cleaned%>%
#   filter(member_casual=="member") %>%
#   select(
#   day_of_week
# ) %>%
#   group_by(day_of_week) %>%
#   mutate(
#     numtrips=n()
#   ) %>%
#   distinct(
#     day_of_week, 
#     .keep_all = TRUE
#   ) %>%
#   arrange(
#     day_of_week
#   )
# 
# # Adding x- axis labels 
# member_weekday1a$day_of_week<- ordered(
#   member_weekday1a$day_of_week, 
#   levels = c(
#     "Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"
#   )
# )
# 
# Weekbar1<- member_weekday1a %>%
#   ggplot(aes(x = day_of_week, y = numtrips)) +
#   geom_bar(stat = "identity", 
#            position = "dodge",
#            aes(fill=numtrips))+
#   labs(title = " Member's most Popular Day for Riding ", y="Total Bike Rides per day", x= "")
# 
# 
# # Casual:
# member_weekday1b <- bike_rides_cleaned%>%
#   filter(member_casual=="casual") %>%
#   select(
#     day_of_week
#   ) %>%
#   group_by(day_of_week) %>%
#   mutate(
#     numtrips=n()
#   ) %>%
#   distinct(
#     day_of_week, 
#     .keep_all = TRUE
#   ) %>%
#   arrange(
#     day_of_week
#   )
# 
# # Adding x- axis labels 
# member_weekday1b$day_of_week<- ordered(
#   member_weekday1b$day_of_week, 
#   levels = c(
#     "Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"
#   )
# )
# 
# Weekbar2<- member_weekday1b %>%
#   ggplot(aes(x = day_of_week, y = numtrips)) +
#   geom_bar(stat = "identity", 
#            position = "dodge",
#            aes(fill=numtrips))+
#   labs(title = " Casual's most Popular Day for Riding ", y="Total Bike Rides per day", x= "")
# 
# # combine
# 
# 
# Weekday2 <- ggarrange(    #code to combine both graph
#   Weekbar1,        #first graph
#   Weekbar2,      #2nd graph
#   ncol = 1, 
#   nrow = 2,
#   common.legend = FALSE
#   #legend = "right"
# )
# 
# #5.3 Bar Graph Ride Length
# # Member bike ride length and riding frequency throughout the year
# 
# #create data frame for members using classic bikes
# bar1a <- bike_rides_cleaned %>%
#   select(month,year, member_casual,ride_length_hrs,rideable_type) %>%
#   filter(member_casual=="member") %>%
#   filter(rideable_type == "classic_bike") %>%
#   group_by(month,year)%>%
#   summarise(Classic_bike=sum(ride_length_hrs),.groups = 'drop') 
# 
# bar1a<- bar1a %>%
#   rename(classic_bike=Classic_bike)
# 
# #create data frame for members using docked bikes
# bar2a<- bike_rides_cleaned %>%
#   select(month, year,  member_casual,ride_length_hrs,rideable_type) %>%
#   filter(member_casual=="member") %>%
#   filter(rideable_type == "docked_bike") %>%
#   group_by(month,year)%>%
#   summarise(dock_bike=sum(ride_length_hrs),.groups = 'drop')
# 
# #create data frame for members using electric bikes
# 
# bar3a <- bike_rides_cleaned %>%
#   select(month, year, member_casual,ride_length_hrs,rideable_type) %>%
#   filter(member_casual=="member") %>%
#   filter(rideable_type == "electric_bike") %>%
#   group_by(month,year)%>%
#   summarise(electric_bike=sum(ride_length_hrs),.groups = 'drop')
# 
# # Join all data frames together. There are only 2 data frames since members do not use docked bikes
# barjoined2<- left_join(bar1a, bar3a, by=c("month"="month","year"="year"))
# 
# #for plotting purpose, change year column to numeric
# barjoined2$year <-as.numeric(barjoined2$year)
# 
# # To produce a correct bar graph transform data frame wide to long
# barj2<-pivot_longer(barjoined2, 3:4)
# 
# # plot the data frame
# bar_final2<-barj2 %>%
#   #pivot_longer(cols = -month) %>%
#   ggplot(aes(x = month, y = value)) +
#   geom_bar(stat = "identity", 
#            position = "dodge",
#            aes(fill=name))+
#   labs(title = " Members Bike Type Used ", y="Total Bike Ride Length (hr)")+
#   coord_cartesian(ylim = c(0, 75000))+
#   guides(fill=guide_legend("Bike Type"))+
#   facet_wrap(~year)
# 
# 
# # Combine both the member and casual bar graphs into one.
# 
# annual_casual_comp <- ggarrange(    #code to combine both graph
#   bar_final1,        #first graph
#   bar_final2,      #2nd graph
#   ncol = 1, 
#   nrow = 2,
#   common.legend = FALSE
#   #legend = "right"
#   )